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Abstract 

O 

Stochastic multi-armed bandits solve the Exploration-Exploitation dilemma and 
£3 ultimately maximize the expected reward. Nonetheless, in many practical prob- 

lems, maximizing the expected reward is not the most desirable objective. In this 
paper, we introduce a novel setting based on the principle of risk-aversion where 
ON the objective is to compete against the arm with the best risk-return trade-off. 

This setting proves to be intrinsically more difficult than the standard multi-arm 

O bandit setting due in part to an exploration risk which introduces a regret associ- 
ated to the variability of an algorithm. Using variance as a measure of risk, we 
introduce two new algorithms, investigate their theoretical guarantees, and report 
c/3 preliminary empirical results. 

O 



> 1 Introduction 

The multi-armed bandit ITPJI elegantly formalizes the problem of on-line learning with partial feed- 
back, which encompasses a large number of real-world applications, such as clinical trials, online 
advertisements, adaptive routing, and cognitive radio. In the stochastic multi-armed bandit model, 
a learner chooses among several arms (e.g., different treatments), each characterized by an indepen- 
dent reward distribution (e.g., the treatment effectiveness). At each point in time, the learner selects 
one arm and receives a noisy reward observation from that arm (e.g., the effect of the treatment on 
• • one patient). Given a finite number of n rounds (e.g., patients involved in the clinical trial), the 

learner faces a dilemma between repeatedly exploring all arms and collecting reward information 
versus exploiting current reward estimates by selecting the arm with the highest estimated reward. 
Roughly speaking, the learning objective is to solve this exploration-exploitation dilemma and accu- 
mulate as much reward as possible over n rounds. In particular, multi-arm bandit literature typically 
focuses on the problem of finding a learning algorithm capable of maximizing the expected cumu- 
lative reward (i.e., the reward collected over n rounds averaged over all possible observation real- 
izations), thus implying that the best arm returns the highest expected reward. Nonetheless, in many 
practical problems, maximizing the expected reward is not always the most desirable objective. For 
instance, in clinical trials, the treatment which works best on average might also have considerable 
variability; resulting in adverse side effects for some patients. In this case, a treatment which is less 
effective on average but consistently effective on different patients may be preferable w.r.t. an effec- 
tive but risky treatment. More generally, some application objectives require an effective trade-off 
between risk and reward. 

There is no agreed upon definition for risk. A variety of behaviours result in an uncertainty which 
might be deemed unfavourable for a specific application and referred to as a risk. For example, 
a solution with guarantees over multiple runs of an algorithm may not satisfy the desire for a so- 
lution with low variability over a single implementation of an algorithm. Two foundational risk 
modeling paradigms are Expected Utility theory [12] and the historically popular and accessible 
Mean- Variance paradigm [ 1 1 . A large part of decision-making theory focuses on defining and 
managing risk (see e.g., ||9l for an introduction to risk from an expected utility theory perspective). 
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Risk has mostly been studied in on-line learning within the so-called expert advice setting (i.e., 
adversarial full-information on-line learning). In particular, [8] showed that in general, although it 
is possible to achieve a small regret w.r.t. to the expert with the best average performance, it is not 
possible to compete against the expert which best trades off between average return and risk. On 
the other hand, it is possible to define no-regret algorithms for simplified measures of risk-return. 
lfT31 studied the case of pure risk minimization (notably variance minimization) in an on-line setting 
where at each step the learner is given a covariance matrix and must choose a weight vector that 
minimizes the variance. The regret is then computed over horizon n and compared to the fixed 
weights minimizing the variance in hindsight. In the multi-arm bandit domain, the most interesting 
results are by and lfl"4l . |5| introduced an analysis of the expected regret and its distribution, 
revealing that an anytime version of UCB |6| and UCB-V might have large regret with some non- 
negligible probability^This analysis is further extended by [ 14| who derived negative results which 
show no anytime algorithm can achieve a regret with both a small expected regret and exponential 
tails. Although these results represent an important step towards the analysis of risk within bandit 
algorithms, they are limited to the case where an algorithm's cumulative reward is compared to the 
reward obtained by pulling the arm with the highest expectation. 

In this paper, we focus on the problem of competing against the arm with the best risk-return trade- 
off. In particular, we refer to the first and most popular measure of risk-return, the mean-variance 
model introduce by |10|. In Section [2] we introduce notation and define the mean-variance bandit 
problem. In Section [3] we introduce a confidence-bound algorithm and study its theoretical prop- 
erties. In Section [5J we report a set of numerical simulations aiming at validating the theoretical 
results. Finally, in Section[7]we conclude with a discussion on possible extensions. The proofs and 
additional experiments are reported in the appendix. 

2 Mean- Variance Multi-arm Bandit 

In this section we introduce the main notation used throughout the paper and define the mean- 
variance multi-arm bandit problem. 

We consider the standard multi-arm bandit setting with K arms, each characterized by a distribution 
Vi bounded in the interval [0, 1]. Each distribution has a mean and a variance of. The bandit 
problem is defined over a finite horizon of n rounds. We denote by JQ jS ~ Vi the s-th random 
sample drawn from the distribution of arm i. All arms and samples are independent. In the multi- 
arm bandit protocol, at each round t, an algorithm selects arm I t and observes sample Xj t j\ t , 

where T^t is the number of samples observed from arm i up to time t (i.e., Ti jt = 2 s =i ^{It = 0)- 

While in the standard literature on multi-armed bandits the objective is to select the arm leading to 
the highest reward in expectation (the arm with the largest expected value pi), here we focus on the 
problem of finding the arm which effectively trades off between its expected reward (i.e., the return) 
and its variability (i.e., the risk). Although a large number of models for risk-return trade-off have 
been proposed, here we focus on the most historically popular and simple model: the mean-variance 
model proposed by [ 10 |(^]where the return of an arm is measured by the expected reward and its risk 
by its variance. 

Definition 1. The mean— variance of an arm i with mean p it variance of and coefficient of absolute 
risk tolerance p is defined a^MV; = of — pfii. 

Thus it easily follows that the best arm minimizes the mean-variance, that is i* = 
argmini = i ... MV^. We notice that we can obtain two extreme settings depending on the value of 
risk tolerance p. As p — > oo, the mean-variance of arm i tends to the opposite of its expected value 
Pi and the problem reduces to the standard expected reward maximization traditionally considered 
in multi-arm bandit problems. With p = 0, the mean-variance reduces to minimizing the variance 
of and the objective becomes variance minimization. 



'Although the analysis is mostly directed to the pseudo-regret, as commented in Remark 2 at page 23 of 
151 . it can be extended to the true regret. 

2 We discuss the limitations of this model and possible extensions to other models of risk in Section[7] 
3 The coefficient of risk tolerance is the inverse of the more popular coefficient of risk aversion A = 1/p. 
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Given {JCj iS }* =1 i.i.d. samples from the distribution we define the empirical mean-variance of 
an arm i with t samples as MV i t — a\ t — pfla, where 



t t 

v 2 



We now consider a learning algorithm A and its corresponding performance over n rounds. Similar 
to a single arm i we define its empirical mean-variance as 

MW n (A) =a 2 n (A) -pMA), (2) 

where 

n 



£»(■*) = ~X>' «l(A) = lj2(Z t -p n (A))\ (3) 
t=i t=i 



with Z t = Xj t! Ti t , that is the reward collected by the algorithm at time t. This leads to a natural 
definition of the (random) regret at each single run of the algorithm as the difference in the mean- 
variance performance of the algorithm compared to the best arm. 

Definition 2. The regret for a learning algorithm A over n rounds is defined as 

n n {A)=MV n (A)-MV l%n . (4) 

Given this definition, the objective is to design an algorithm whose regret decreases as the number 
of rounds increases (in high probability or in expectation). 

We notice that the previous definition actually depends on unobserved samples. In fact, MVi* is 
computed on n samples i* which are not actually observed when running A. This matches the defi- 
nition of true regret in standard bandits (see e.g., Q). Thus, in order to clarify the main components 
characterizing the regret, we introduce additional notation. Let 

fx t , t if i = i* 

Y i,t = { Xi*, t > with t' = Ti«,„ + T i,n + t otherwise 

be a renaming of the samples from the optimal arm, such that while the algorithm was pulling arm 
i for the t-th time, Y.- L , t is the unobserved sample from i* . Then we define the corresponding mean 
and variance as 

Ai,T i>n = ^— X Yi ' u °i,Ti n = ~ A»,T 4 ,„) ■ (5) 

{=1 lt < n 4=1 



Given these additional definitions, we can rewrite the regret as (see Appendix A. 1 1 

TZ n (A) = V T,,„ [(CT? T - Pk,Ti,n) - {3i,Ti n ~ PPi,Ti,n) 



71 



K , Jf 



- ^ Tj,» (At,T t , n -An(^.)) 2 - - y^7i,n(Ai,ri,n ~ Ai*,n) 2 - (6) 



n * — ' 1 n 

»=1 i=l 



Since the last term is always negative and small]^] our analysis focuses on the first two terms which 
reveal two interesting characteristics of A. First, an algorithm A suffers a regret whenever it chooses 
a suboptimal arm i ^ i* and the regret corresponds to the difference in the empirical mean-variance 
of i w.r.t. the optimal arm i*. Such a definition has a strong similarity to the standard definition 
of regret, where i* is the arm with highest expected value and the regret depends on the number of 
times suboptimal arms are pulled and their respective gaps w.r.t. the optimal arm i* An contrast to the 
standard formulation of regret, A also suffers an additional regret from the variance &^(A), which 
depends on the variability of pulls T .„ over different arms. Recalling the definition of the mean 



4 More precisely, it can be shown that this term decreases with rate 0{K log(l/(5) / n) with probability 1 — 5. 



3 



fi n (A) as the weighted mean of the empirical means pn.Ti „ with weights Ti^ n /n (see eq. we 
notice that this second term is a weighted variance of the means and illustrates the exploration risk 
of the algorithm. In fact, if an algorithm simply selects and pulls a single arm from the beginning, it 
would not suffer any exploration risk (secondary regret) since p, n (A) would coincide with /x^T; „ for 
the chosen arm and all other components would have zero weight. On the other hand, an algorithm 
accumulates exploration risk through this second term as the mean p, n (A) deviates from any specific 
arm; where the maximum exploration risk peaks at the mean fi n (A) furthest from all arm means. 

The previous definition of regret can be further elaborated to obtain the upper bound (see App. |A.l| ) 

K 

' (7) 



K n (A) < i E + ^EEViA> 

i^i* ' i=l j^i 



where A. t = (of ]Ti>n - of,r s , J _ P(fk,T iin ~ fk,T iin ) and Y 2 l3 = (fii,T iin ~ P-j,T jt J 2 - Unlike the 
definition in eq. [6] this upper bound explicitly illustrates the relationship between the regret and the 
number of pulls T i n \ suggesting that a bound on the pulls is sufficient to bound the regret. 

Finally, we can also introduce a definition of the pseudo-regret. 

Definition 3. The pseudo regret for a learning algorithm A over n rounds is defined as 

1 2 K 

= -£ T hn A t + - 2 J2 E r i,« 2 i," I ti> (8) 

where Aj = MV^ — MVj. and Tij — fii — (J,j. 

In the following, we denote the two components of the pseudo-regret as 



(9) 



KiA) = 1 E r ^ A » and <w = A E E T -- T ^ v ir 

Where TZ^ (A) constitutes the standard regret derived from the traditional formulation of the multi- 
arm bandit problem and 7£„ (A) denotes the exploration risk. This regret can be shown to be close 
to the true regret up to small terms with high probability. 

Lemma 1. Given definitions^and^ 



V n n 

with probability at least 1 — 8. 

The previous lemma shows that any (high-probability) bound on the pseudo-regret immediately 
translates into a bound on the true regret. Thus, we report most of the theoretical analysis according 
to TZ n (A). Nonetheless, it is interesting to notice the major difference between the true and pseudo- 
regret when compared to the standard bandit problem. In fact, it is possible to show in the risk- 
averse case that the pseudo-regret is not an unbiased estimator of the true regret, i.e., E[7£ n ] ^ 
E[7£ n ]. Thus, in order to bound the expectation of lZ n we build on the high-probability result from 
Lemma Q] 



3 The Mean- Variance Lower Confidence Bound Algorithm 

In this section we introduce a novel risk-averse bandit algorithm whose objective is to identify the 
arm which best trades off risk and return. The algorithm is a natural extension of UCB1 |6| and we 
report a theoretical performance analysis on how well it balances the exploration needed to identify 
the best arm versus the risk of pulling arms with different means. 
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Input: Confidence 8 
for t = 1, . . . , n do 
for i = 1, . . . , K do 

Compute Bi,T M _i = MV i| r M _ 1 - (5 + 
end for 

Return J t = argmin^i,...,^ B 4 . Ti t _ x 
Update = Ti.t-i + 1 
Observe Xi u Ti t ~ 
Update MV ilTj t 
end for 



Figure 1 : Pseudo-code of the MV-LCB algorithm. 



3.1 The Algorithm 



We propose an index-based bandit algorithm which estimates the mean-variance of each arm and 
selects the optimal arm according to the optimistic confidence-bounds on the current estimates. A 
sketch of the algorithm is reported in Figure [T] For each arm, the algorithm keeps track of the 
empirical mean-variance MVi, s computed according to s samples. We can build high-probability 
confidence bounds on empirical mean-variance through an application of the Chernoff-Hoeffding 
inequality (see e.g., |Q] for the bound on the variance) on terms jj, and a 2 . 

Lemma 2. Let {JQ iS } be i.i.d. random variables bounded in [0, 1] from the distribution Vi with mean 
fii and variance af, and the empirical mean /2j )S and variance af s computed as in Equation^ then 



3i = l,. 



,K,s = l, 



,n,\MVi, s -MVi\ > (5 + p) 



jogl/j 

2s 



< 6nKS 1 



The algorithm in Figure [TJ implements the principle of optimism in the face of uncertainty used in 
many multi-arm bandit algorithms. On the basis of the previous confidence bounds, we define a 
lower-confidence bound on the mean-variance of arm i when it has been pulled s times as 



^, s =MV ilS -(5 + p)W^|^, (10) 

where 5 is an input parameter of the algorithm. Given the index of each arm at each round t, the al- 
gorithm simply selects the arm with the smallest mean-variance index, i.e., I t — argmin^ -B^T; t ~i- 
We refer to this algorithm as the mean-variance lower-confidence bound (MV-LCB) algorithm. 

Remark 1. We notice that the algorithm reduces to UCB1 whenever p — > oo. This is coherent with 
the fact that for p — > oo the mean-variance problem reduces to the maximization of the cumulative 
reward, for which UCB1 is already known to be nearly-optimal. On the other hand, for p = 0, which 
leads to the problem of cumulative reward variance minimization, the algorithm plays according to 
a lower-confidence-bound on the variances. 

Remark 2. The MV-LCB algorithm is parameterized by a parameter 5 which defines the confidence 
level of the bounds employed in the definition of the index ( 10 1. In Theorem T]we show how 



to optimize the parameter when the horizon n is known in advance. On the other hand, if n is 
not known, it is possible to design an anytime version of MV-LCB by defining a non-decreasing 
exploration sequence (e t ) t instead of the term log 1/8. 



3.2 Theoretical Analysis 

In this section we report the analysis of the regret lZ n (A) of MV-LCB (Fig. [I}. As highlighted in 
eq. [7] it is enough to analyze the number of pulls for each of the arms to recover a bound on the 
regret. The proofs (reported in the appendix) are mostly based on similar arguments to the proof of 
UCB. 

We derive the following regret bound in high probability and expectation. 
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Theorem 1. Let the optimal arm i* be unique and b = 2(5 + p), the MV-LCB algorithm achieves 
a pseudo-regret bounded as 



, Aj ^ A 2 n 6^T^ A i A */ n 

with probability at least 1 — 6nKS. Similarly, if MV-LCB is run with 6 — 1 /n 2 then 



X 2^2* Z^Z* 1 Z^Z* Jjtl L J ' 

Remark 1 (the bound). Let A m ; n = mirij^j. A, and r max = max^ \Ti\, then a rough simplification 
of the previous bound leads to 

K lo S™ , ^2 r L X l0g 2 n^ 



VAmin n Ax,-, n J 



First we notice that the regret decreases as 0(log 2 n/n), implying that MV-LCB is a consistent 
algorithm. As already highlighted in Definition|2j the regret is mainly composed by two terms. The 
first term is due to the difference in the mean-variance of the best arm and the arms pulled by the 
algorithm, while the second term denotes the additional variance introduced by the exploration risk 
of pulling arms with different means. In particular, it is interesting to note that this additional term 
depends on the squared difference in the means of the arms T| j. Thus, if all the arms have the same 
mean, this term would be zero. 

Remark 2 (worst-case analysis). We can further study the result of Theorem[T]by considering the 
worst-case performance of MV-LCB, that is the performance when the distributions of the arms are 
chosen so as to maximize the regret. In order to illustrate our argument we consider the simple case 
of K = 2 arms, p = (variance minimization), p,x ^ /i 2 , and a\ = cr| = (deterministic arms).r] 
In this case we have a variance gap A = and T 2 > 0. According to the definition of MV-LCB~ 
the index Bi _ s would simply reduce to Bi _ s — y/\og{l/8)/s, thus forcing the algorithm to pull both 
arms uniformly (i.e., Ti „ = T 2 .„ = n/2 up to rounding effects). Since the arms have the same 
variance, there is no direct regret in pulling either one or the other. Nonetheless, the algorithm has 
an additional variance due to the difference in the samples drawn from distributions with different 
means. In this case, the algorithm suffers a constant (true) regret 

n n {Mv-LCB) = o + Tl '"f 2 '" r 2 = \v 2 . 

independent from the number of rounds n. This argument can be generalized to multiple arms and 
p 7^ 0, since it is always possible to design an environment (i.e., a set of distributions) such that 
A m in = and r max ^ 0. |^] This result is not surprising. In fact, two arms with the same mean- 
variance are likely to produce similar observations, thus leading MV-LCB to pull the two arms 
repeatedly over time, since the algorithm is designed to try to discriminate between similar arms. 
Although this behavior does not suffer from any regret in pulling the "suboptimal" arm (the two arms 
are equivalent), it does introduce an additional variance, due to the difference in the means of the 
arms (r 7^ 0), which finally leads to a regret the algorithm is not "aware" of. This argument suggests 
that, for any n, it is always possible to design an environment for which MV-LCB has a constant 
regret. This is particularly interesting since it reveals a huge gap between the mean-variance prob- 
lem and the standard expected regret minimization problem and will be further investigated in the 
numerical simulations presented in Section |5] In fact, in the latter case, UCB is known to have a 
worst-case regret per round of £1(1 /y/n) [3], while in the worst case, MV-LCB suffers a constant 
regret. In the next section we introduce a simple algorithm able to deal with this problem and achieve 
a vanishing worst-case regret. 



5 Note that in this case (i.e., A = 0), Theorem[T]does not hold, since the optimal arm is not unique. 
6 Notice that this is always possible for a large majority of distributions for which the mean and variance are 
independent or mildly correlated. 
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Worst Case Regret vs. n 

p— MV-LCB I 
ExpExp 



Figure 2: Regret of MV-LCB and ExpExp in different scenarios. 

4 The Exploration-Exploitation Algorithm 

The ExpExp algorithm divides the time horizon n into two distinct phases of length r and n — r 
respectively. During the first phase all the arms are explored uniformly, thus collecting t/K samples 
each^j Once the exploration phase is over, the mean-variance of each arm is computed and the arm 
with the smallest estimated mean-variance MV i T /^ is repeatedly pulled until the end. 

The MV-LCB is specifically designed to minimize the probability of pulling the wrong arms, so 
whenever there are two equivalent arms (i.e., arms with the same mean-variance), the algorithm 
tends to pull them the same number of times, at the cost of potentially introducing an additional 
variance which might result in a constant regret. On the other hand, ExpExp stops exploring the 
arms after r rounds and then elicits one arm as the best and keeps pulling it for the remaining n — r 
rounds. Intuitively, the parameter r should be tuned so as to meet different requirements. The 
first part of the regret (i.e., the regret coming from pulling the suboptimal arms) suggests that the 
exploration phase r should be long enough for the algorithm to select the empirically best arm i* 
at r equivalent to the actual optimal arm i* with high probability; and at the same time, as short as 
possible to reduce the number of times the suboptimal arms are explored. On the other hand, the 
second part of the regret (i.e., the variance of pulling arms with different means) is minimized by 
taking r as small as possible (e.g., t = would guarantee a zero regret). The following theorem 
illustrates the optimal trade-off between these contrasting needs. 

Theorem 2. Let ExpExp be run with r = if(n/14) 2 / 3 , then for any choice of distributions {z^} 
the expected regret is WjZ n (A)\ < 2^j^. 

Remark 1 (the bound). We first notice that this bound suggests that ExpExp performs worse 
than MV-LCB on easy problems. In fact, Theorem [T] demonstrates that MV-LCB has a regret 
decreasing as O (K log (n)/n) whenever the gaps A are not small compared to n, while in the 
remarks of Theorem[T]we highlighted the fact that for any value of n, it is always possible to design 
an environment which leads MV-LCB to suffer a constant regret. On the other hand, the previous 
bound for ExpExp is distribution independent and indicates the regret is still a decreasing function 
of n even in the worst case. This opens the question whether it is possible to design an algorithm 
which works as well as MV-LCB on easy problems and as robustly as ExpExp on difficult problems. 

Remark 2 (exploration phase). The previous result can be improved by changing the exploration 
strategy used in the first r rounds. Instead of a pure uniform exploration of all the arms, we could 
adopt a best-arm identification algorithms such as Successive Reject or UCB-E, which maximize 
the probability of returning the best arm given a fixed budget of rounds r (see e.g., lIU). 

5 Numerical Simulations 

In this section we report numerical simulations aimed at validating the main theoretical findings 
reported in the previous sections. In the following graphs we study the true regret lZ n (A) averaged 
over 500 runs. We first consider the variance minimization problem (p = 0) with K = 2 Gaussian 



7 In the definition and in the following analysis we ignore rounding effects. 
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arms set to p x = 1.0, fi 2 = 0.5, a\ = 0.05, and a\ = 0.25 and run MV-LCB Q In Figure [2] we 
report the true regret lZ n (as in the original definition in eq.Q and its two components and 
(these two values are defined as in eq. [9] with A and T replacing A and Y). As expected (see e.g., 
Theorem[T]l, the regret is characterized by the regret realized from pulling suboptimal arms and arms 
with different means (Exploration Risk) and tends to zero as n increases. Indeed, if we considered 

two distributions with equal means (pi = /ia), the average regret coincides with 1Z^. Furthermore, 
as shown in Theorem[T]the two regret terms decrease with the same rate 0(log n/n). 

A detailed analysis of the impact of A and Y on the performance of MV-LCB is reported in Ap- 
pendix^ Here we only compare the worst-case performance of MV-LCB to ExpExp (see Figure|2|. 
In order to have a fair comparison, for any value of n and for each of the two algorithms, we select 
the pair A^,^ which corresponds to the largest regret (we search in a grid of values with p\ = 1.5, 
p 2 G [0.4; 1.5], a\ G [0.0; 0.25], and a\ = 0.25, so that A € [0.0; 0.25] and Y E [0.0; 1.1]). As 
discussed in Section|4 while the worst-case regret of ExpExp keeps decreasing over n, it is always 
possible to find a problem for which regret of MV-LCB stabilizes to a constant. For numerical 
results with multiple values of p and 15 arms, please see Appendix [D] 

6 Discussion 

In this paper we evaluate the risk of an algorithm in terms of the variability of the sequences of 
samples that it actually generates. Although this notion might resemble other analyses of UCB- 
based algorithms (see e.g., the high-probability analysis in j5|), it captures different features of the 
learning algorithm. Whenever a bandit algorithm is run over n rounds, its behavior, combined with 
the arms' distributions, generates a probability distribution over sequences of n rewards. While the 
quality of this sequence is usually defined by its cumulative sum (or average), here we say that a 
sequence of rewards is good if it displays a good trade-off between its (empirical) mean and variance. 
It is important to notice that this notion of risk-return tradeoff does not coincide with the variance of 
the algorithm over multiple runs. 

Let us consider a simple case with two arms that deterministically generate 0s and Is respectively, 
and two different algorithms. Algorithm A\ pulls the arms in a fixed sequence at each run (e.g., 
arm 1, arm 2, arm 1, arm 2, and so on), so that each arm is always pulled n/2 times. Algorithm A2 
chooses one arm uniformly at random at the beginning of the run and repeatedly pulls this arm for 
n rounds. Algorithm A\ generates sequences such as 010101... which have high variability within 
each run, incurs a high regret (e.g., if p = 0), but has no variance over multiple runs because it always 
generates the same sequence. On the other hand, A2 has no variability in each run, since it generates 
sequences with only 0s or only Is, suffers no regret in the case of variance minimization, but has 
high variance over multiple runs since the two completely different sequences are generated with 
equal probability. This simple example demonstrates that an algorithm with a very small standard 
regret w.r.t. the cumulative reward (e.g., A\), might result in a very high variability in a single run 
of the algorithm, while an algorithm with small mean-variance regret (e.g., A2) could have a high 
variance over multiple runs. 

7 Conclusions 

The majority of multi-armed bandit literature focuses on the problem of minimizing the regret w.r.t. 
the arm with the highest return in expectation. We study the notion of risk associated to the variance 
over multiple runs and risk of variability associated to a single run of an algorithm. The later case 
highlights an interesting effect on the regret due to the need to estimate variability within a single 
sequence of finite random samples before making a risk-averse decision. Further, controling the 
variance risk over multiple runs does not necessarily control the risk of variability over a single run. 
In this paper, we introduced a novel multi-armed bandit setting where the objective is to perform 
as well as the arm with the best risk-return trade-off. In particular, we relied on the mean-variance 
model introduced in [ 1 1 to measure the performance of the arms and define the regret of a learning 
algorithm. We proposed two novel algorithms to solve the mean-variance bandit problem and we 
reported their corresponding theoretical analysis. While MV-LCB shows a small regret of order 
0(log n/n) on "easy" problems (i.e., where the mean-variance gaps A are big w.r.t. n), we showed 
that it has a constant worst-case regret. On the other hand, we proved that ExpExp has a vanishing 



Notice that although in the paper we assumed the distributions to be bounded in [0, 1] all the results can be 
extended to sub-Gaussian distributions. 
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worst-case regret at the cost of worse performance on "easy" problems. To the best of our knowledge 
this is the first work introducing risk-aversion in the multi-armed bandit setting and it opens a series 
of interesting questions. 

Lower bound. In this paper we introduced two algorithms, MV-LCB and ExpExp. As discussed in 
the remarks of Theorem [7] and Theorem MV-LCB has a regret of order 0(y/K/n) on easy prob- 
lems and O(l) on difficult problems, while ExpExp achieves the same regret 0{K/n}^) over all 
problems. The primary open question is whether 0(K /n 1 / 3 ) is actually the best possible achievable 
rate (in the worst-case) for this problem or a better rate is possible. This question is of particular 
interest since the standard reward expectation maximization problem has a known lower-bound of 
tl(y/l/n), and a minimax rate of f^l/n 1 / 3 ) for the mean-variance problem would imply that the 
risk-averse bandit problem is intrinsically more difficult than standard bandit problems. 

Different measures of return-risk. Considering alternative notions of risk is a straightforward 
extension to the previous setting. In fact, over the years the mean-variance model has often been 
criticized. From a point of view of the expected utility theory, the mean-variance model is only 
justified under a Gaussianity assumption on the arm distributions. It also violates the monotonocity 
condition due to the different orders of the mean and variance and is not a coherent measure of 
risk Q. Furthermore, the variance is a symmetric measure of risk, while it is often the case that 
only one-sided deviations from the mean are undesirable (e.g., in finance only losses w.r.t. to the 
expected return are considered as a risk, while any positive deviation is not considered as a real 
risk). A popular replacement for the mean-variance is to use the a value-at-risk (i.e., the quantile) 
to measure the risk of a random variable. The main challenge in this case is the estimation of the 
value-at-risk for each arm. In fact, while the cumulative distribution of a random variable can be 
reliably estimated (see e.g., ifTTI '). estimating the quantile might be more difficult. 

In (2) axiomatic rules are listed to define coherent measures of risk. Though a value-at-risk violates 
these rules, Conditional Value at Risk (otherwise known as average value at risk, tail value at risk, 
expected shortfall and lower tail risk) passes these rules as a coherent measure of risk. One can 
easily imagine a lower confidence bound algorithm based on Q in the same composition as MV- 
LCB which replaces the variance by the conditional value at risk. 

The notion of optimality in the risk sensitive setting also depends on the selection of a single-period 
or multi-period risk evaluation. While the single-period risk of an arm is simply the risk of its 
distribution, in a multi-period evaluation we consider the risk of the sum of rewards obtained by 
repeatedly pulling the same arm over n rounds. Unlike the variance, for which the variance of a 
sum of n independent realizations of the same random variable is simply n times its variance, for 
other measures of risk (e.g., a value-at-risk) this is not necessarily the case. As a result, an arm 
with the smallest single-period risk might not be the optimal choice over an horizon of n rounds. 
Therefore, the performance of a learning algorithm should be compared to the smallest risk that can 
be achieved by any sequence of arms over n rounds, thus requiring a new definition of regret. 

Linear bandits. In linear bandits, each arm is characterized by a marginal distribution with expected 
value fii and a covariance matrix C. At each step the learner chooses a combination of arms and 
observes the corresponding combined reward. In this case, the best combination is obtained by 
solving the mean-variance quadratic program mm x (x T Cx — px T fi) where x is usually a point in 
the X-dimensional simplex (e.g., in finance x is in the simplex when no short-selling is allowed). 
Similar to the multi-arm case, the objective is to define an algorithm able to achieve a mean-variance 
as small as the best point in the simplex over n rounds. 

Simple regret. Finally, an interesting related problem is the simple regret setting where the learner 
is allowed to explore over n rounds and it only suffers a regret defined on the solution returned at 
the end. It is known that it is possible to design algorithm able to effectively estimate the mean of 
the arms and finally return the best arm with high probability. In the risk-return setting, the objective 
would be to return the arm with the best risk-return tradeoff. 
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A The Regret 
A.1 The True Regret 

We recall the definition of the (empirical) regret as 

K n {A) =MV n (^)-MV i ., n . 

Given the definitions reported in the main paper, we first elaborate on the two mean terms in the 
regret as 



A Ti 



A 



Mi* ,n — — Yi.t — — TinjXiTt „ ! 

t=l t=l t=l 



and 



A Ti 



n(A) = - e E x it = - y^ T^ n ^ iTi 

i=l t=l i=l 



Similarly, the two variance terms can be written as 

K T itn 

^(■4) = -EE( x ^-A»G4)) 2 

i=l t=l 

= - EE " hTiS + - E E - An(^)) 2 + - E E ( X M - )(/'•' • - An (-4)) 

i=l t = l i=l t=l i=l t=l 

A 



1 A- x K 

~ E Ti >^i,Ti, n + ~ E T i>n(P'i,Ti l n ~ An (-4)) + 0, 



and 



4,«4EI>. 



,t — Mi*,n J 



A" 3i, n 



- EE 0^* - Vm,S + - E E 0*W,« - A 2 *,n) 2 + - E E 0^* - hn, n )(hT i>n - A**,n) 
i=l t=l i=l t=l i=l t=l 



x A 1 A 

= ~ E r <.* 5 *!r ll „ + " E Tl ."(A 



,T< „ - Mi 



Putting together these terms, we obtain the regret (see eq. [4} 



Kn(A) = - V ?i n [(a-T. - a 2 i T ) - p(Ai,T; „ - Mi.^.J 



i^i* 
A 



1 K 2 1 K 
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If we further elaborate the second term, we obtain 

K - k , K 



i—1 i—1 J — 1 

1 A ' / A ' T- 

i=i j=i 

i K K j, 

- - H Ti ^H^^ T ^ - Aj,T i>n ) 2 

8=1 J=l 

1 K 



71^ 

1=1 j^i 



Using the definitions A, = (of T . n - <5f iT . J - p{fri,T i<n - A»,T 4 ,„) and rf^ = (Ai,T iira - Ai.Tj.J 2 
we finally obtain an upper-bound on the regret of the form 



n * — ' n 



In the following we refer to the two terms as TZ„ and TZ^. 
A.2 The Pseudo-Regret 

Similar to what is done in the standard bandit problem, we can introduce a different notion of regret. 
Starting from the last equation in the previous section, we define the pseudo-regret 

1 2 K 

TZn(A) = — Ti, n Ai + ^ T^nTj^T^j, 
i^i* i—1 j^i 

where the empirical values Aj and T^j are substituted by their corresponding exact values^] In the 
following we show that the true and pseudo regrets different for values that tend to zero with high 
probability. 

Proof. (Lemma [TJ 

We define a high-probability event in which the empirical values and the true values only differ for 
small quantities 



1 1 K 

n n {A) <-Y,T i>n &i + -~Y,ll T ^ T ^ T lr 



2s 1 l - s 1 1 " V 2s 

Using Chernoff-Hoeffding inequality and a union bound over arms and rounds, we have that 
P^ ] < QnKS. Under this event we rewrite the empirical Aj as 

Aj = Aj - (of - of.) + p{fii - fJ,^) + (of T - of T ) - p(fii, Ti , n ~ h,Ti,n) 



< Aj + 2(5 + p)i 



'log 1/5 



» 27 i, 

Similarly, Tij is upper-bounded as 



IJ| ^V 2T V „ 2T jin 



'Notice that the factor 2 in front of the second term is due to a rough upper bounding used in the proof of 



Lemma[T] 
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Thus the regret can be written as 



a^i* V i— 1 j^i V V -" 

< J E T *.«&* + ^ E \/2^,nio g i/5 + ^ E E r,,nT„ n r? 



n z — ' n * — ' 

i^i* i^i* i = l j^i 

+ ^ E E r - lo g v'+^EE T - lo s v* 

i=l j^i i=l 



^2 



< -E T - A ^ + A E E 2i.-Ti.nlti + (5 + P) v + 4^^^ 



n 1 — ' n 

ij^i* i=l j^i 



where in the next to last passage we used Jensen's inequality for concave functions and rough upper 
bounds on other terms (K — 1 < K, X^i* ^i,n < n )- By recalling the definition of lZ n (A) we 
finally obtain 



~, , ,s ^ r ,s , s 2K\ogl/6 A /-Klogl/6 



n n 

with probability 1 — 6nKS. Thus we can conclude that any upper bound on the pseudo-regret lZ n (A) 
is a valid upper bound for the true regret TZ n (A), up to a decreasing term of order 0(y / K/n). 

□ 

B MV-LCB Theoretical Analysis 

In order to simplify the notation in the following we use b — 2(5 + p). 
Proof. (Theorem [TJ 

We begin by defining a high-probability event £ as 



£ = { Vi = 1, . • • , K, Vs = 1, . . . \fh B - Mi | < and \a\ s - of I < 5- /l0g 1/6 



2s 1 l ' s V 2s 

Using Chernoff-Hoeffding inequality and a union bound over arms and rounds, we have that 

¥[£ c ] < 6nKS. 

We now introduce the definition of the algorithm. Consider any time t when arm i ^ i* is pulled 
(i.e., I t = i). By definition of the algorithm in Figure [T[ i is selected if its corresponding index 
Bi^Tt t -i i s digger than for any other arm, notably the best arm i* . By recalling the definition of the 
index and the empirical mean-variance at time t, we have 



°i,Ti,t-l ~ PI*,Ti,t-i ~ ( 5 + P)]l 2T i = B i,Ti,t-i < 



/ log 1/(5 



Over all the possible realizations, we now focus on the realizations in £. In this case, we can rewrite 
the previous condition as 

af - PfM - 2(5 + p) J^jY^ < v • < ^ %T „, f _ 1 < 4 - 
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Let time t be the last time when arm i is pulled until the final round n, then Tn-\ — Ti <n — 1 and 

2(5 + p) 2 , 1 
T hn < K A / ; log- + l, 



which suggests that the suboptimal arms are pulled only few times with high probability. Plugging 
the bound in the regret in eq.[8]leads to the final statement 

H.IA) < ■ v + ■ v 462 y n ,4ee 2bi{l :?f f rh + -■ 

v ; - n V A, n V A? 1 •* n 2 V , A 2 A 2 ZJ n 

with probability 1 — 6nKS. 

We now move from the previous high-probability bound to a bound in expectation. The pseudo- 
regret is (roughly) bounded as lZ n (A) < 2 + p (by bounding Aj < 1 + p and V < 1), thus 

E[K n (A)] = E[n n (A)I{£}] +E[TZ n {A)l{£ c }] < E[U n (A)I{£}] + (2 + p)V[£ c ]. 

By using the previous high-probability bound and recalling that P[£ c ] < 6nKS, we have 

WW (AV < 1 >T gjggV* , 1 \- 4& 2 loglM 2 1 2b\\ogl/Sr 
^^n^ Ai Af '*' + ^LL A 2 A? L i,j 

+ + (2 + p)6nKS. 

n 

The final statement of the lemma follows by tuning the parameter 5 = 1/n 2 so as to have a regret 
bound decreasing with n. □ 

While a high-probability bound for 7Z n can be immediately obtained from Lemma[T[ the expectation 
of lZ n is reported in the next corollary. 

Proof. Since the mean-variance — p < MV < 1 /4, the regret is bounded by — 1/4 — p < lZ n (A) < 
1/4 + p. Thus we have 



E[K n (A)] =< vF[R, n {A) < u] + (~ + p y[R n (A) 



> u 



By taking u equal to the previous high-probability bound and recalling that P[£ c ] < QnK5, we 
have 

ww r^i< 1 V ^QgVS , I v 4b2l °g 1 ^ r 2 1 V V ^>sW r 2 
r A + n 2^ A 2 ^v + ^LL A 2 A 2 ^ij 



5K , /if log 1/5 4 y-iflogl/5 /l \„ „ r 

+ + & V + 4^ + ( t + P 6ni«. 

n V 2n n \4 / 

The final statement of the lemma follows by tuning the parameter 5 = 1/n 2 so as to have a regret 
bound decreasing with n. □ 

C Exp-Exp Theoretical Analysis 

During the exploitation phase the algorithm pulls arm i* with the smallest empirical variance es- 
timated during the exploration phase of length r. As a result, the number of pulls of each arm 
is 



T i>n = I- + (n - r)I{* = i*} (11) 
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We analyze the two terms of the regret separately. 



(a) 

We notice that the only random variable in this formulation is the best arm i* at the end of the 
exploration phase. We thus compute the expected value of 1Z^- 

E[(a)] = P[i = i*]A< - P[Vj ^ i, &l T/K < al T/K }Ai 

<2A 4 cxp(--^A 4 2 ) 
The second term in the regret can be bounded as follows. 

= h E E (J + (» - '■w = n) (J + (» - '■w = s*})r?j 
= n ^EE r l+ 2 ^ E E r ^{* = > 

< +2^ ^- < 2- 

Grouping all the terms, ExpExp has an expected regret bounded as 

Efe./*)] < 2 T - + 2 ^ A, exp ( - ^A?) 

We can now move to the worst-case analysis of the regret. Let /(Aj) = A, exp ^ — -^A 2 ^, the 
"adversarial" choice of the gap is determined by maximizing the regret, which corresponds to 

/'(A,) = exp ( - Ja?) + Aj - 2^, exp ( - iAj)J 
= (l- 2 ^)e X p(-^)=0, 
and leads to a worst-case choice for the gap of 




The worst-case regret is then 



15 



E[n n (A)} <2- + (K- IW2K— exp(-0.5) < 2- + 

n V T n v 

We can now choose the parameter r minimizing the worst-case regret. Taking the derivative of the 
regret w.r.t. r we obtain 



dE{K n (A)} 2 1/^x3/2 
rfr n 2V1J 

thus leading to the optimal parameter r = (n/4) 2//3 if . The final regret is thus bounded as 



E[Tl n (A)] < 3-^3. 
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D Additional Simulations 

D.l Comparison between MV-LCB and ExpExp with K = 2 




Figure 3: Regret K n of MV-LCB. 

We consider the variance minimization problem (p = 0) with K = 2 Gaussian arms with different 
means and variances. In particular, we consider a grid of values with p\ = 1.5, /12 G [0.4; 1.5], 
a\ G [0.0; 0.25], and <r| = 0.25, so that A G [0.0; 0.25] and T G [0.0; 1.1] and number of rounds 
n E [50; 2.5 x 10 5 ]. Figures[3]and|4]report the mean regret for different values of n. The colors are 
renormalized in each plot so that dark blue corresponds to the smallest regret and red to the largest 
regret. The results confirm the theoretical findings of Theorem [T] and [2] In fact, for simple problems 
(large gaps A) MV-LCB converges to a zero-regret faster than ExpExp, while for A close to zero 
(i.e., equivalent arms), MV-LCB has a constant regret which does not decrease with n and the regret 
of ExpExp slowly decreases to zero. 

D.2 Risk tolerance sensitivity 

In Section [5] we report numerical results demonstrating the composition of the regret and perfor- 
mance of algorithms with only 2 arms in the case of variance minimization. Here we report results 
for a wide range of risk tolerance p £ [0.0; 10.0] and K — 15 arms. We set the mean and variance 
for each of the 15 arms so that a subset of arms is always dominated (i.e., for any p, MV^ > MV£.) 

demonstrating the effect of different p values on the position of the optimal arm i* 
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In Figure [2] we arranged the true values of each arm along the red fronteir and the p-directed per- 
formance of the algorithms in a standard deviation-mean plot. The green and blue lines show the 
standard deviation and mean for the performance of each algorithm for a specific p setting and fi- 
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Efficient Frontier of Risk-Aversion, n = 50 Efficient Frontier of Risk-Aversion, n = 100 Efficient Frontier of Risk- Aversion, n = 250 
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Efficient Frontier of Risk- Aversion, n = 500 Efficient Frontier of Risk- Aversion, n = iOOO Efficient Frontier of Risk-Aversion, n = 2500 
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Efficient Frontier of Risk-Aversion, n = 5000 Efficient Frontier of Risk- Aversion, n = 10000 Efficient Frontier of Risk-Aversion, n = 25000 
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Efficient Frontier of Risk- Aversion, n = 50000 Efficient Frontier of Risk-Aversion, n = 100000 Efficient Frontier of Risk- Aversion, n = 250000 
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Figure 6: Risk tolerance sensitivity of MV-LCB and ExpExp for Configuration 1. 



nite time n, where each point represents the resulting mean-standard deviation of the sequence of 
pulls on the arms by the algorithm with that specific value of p. The gap between the p specific 
performance of the algorithm and the corresponding optimal arm along the red frontier represents 
the regret for the specific p value. Accordingly, the gap between the algorithm performance curves 
represents the gap in performance with regard to MV-LCB versus ExpExp. Where a lot of arms 
have big gaps (e.g., all the dominated arms have a large gap for any value of p), MV-LCB tends to 
perform better than ExpExp. The series of plots represent increasing values of n and demonstrate 
the relative algorithm performance versus the optimal red frontier. The set of plots represent the 
two settings reported in Figure [5] We chose the values of the arms so as to have configurations with 
different complexities. In particular, configuration 1 corresponds to "easy" problems for MV-LCB 
since the arms all have quite large gaps (for different values of p) and this should allow it to perform 
well. On the other hand, the second configuration has much smaller gaps and, thus, higher complex- 
ity. According to the bounds for MV-LCB we know that a good proxy for its learning complexity is 
represented by the term p . In Figurepjwe report such complexity for different values of p 

and, as it can be noticed, configuration 2 has always a higher complexity than configuration 1 . 
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Efficient Frontier of Risk-Aversion, n = 50 Efficient Frontier of Risk-Aversion, n = 100 Efficient Frontier of Risk- A vers ion, n = 250 
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Efficient Frontier of Risk-Aversion, n = 500 Efficient Frontier of Risk- Aversion, n = 1000 Efficient Frontier of Risk- Aversion, n = 2500 




CM O CO CD O CO CM CM O CO CD O CO CM CM O CO CD O N CO CM 

CM CO CO ^ (D CD S CD Ol CO CO ^ CD ID N CD CM CO CO !D CD N CD 

& x in -2 a x in -2 o- x in -2 



Efficient. Frontier of Risk-Aversion, n = 5000 Efficient Frontier of Risk-Aversion, n = fOOOO Efficient Frontier of Risk-Aversion, n = 25(100 
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Efficient. Frontier of Risk- Aversion, n = 50000 Efficient Frontier of Risk- Aversion, n = 100000 Efficient Frontier of Risk-Aversion, n = 250000 
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Figure 7: Risk tolerance sensitivity of MV-LCB and ExpExp for Configuration 2. 

As we notice, in both configurations the performance of MV-LCB and ExpExp approach one of 
the optimal arms i* for each specific p as n increases. Nonetheless, in configuration 1 the large 
number of suboptimal arms (e.g., arms with large gaps) allows MV-LCB to outperform ExpExp 
and converge faster to the optimal arm (and thus zero regret). On the other hand, in configuration 2 
there are more arms with similar performance and for some values of p ExpExp eventually achieves 
better performance than MV-LCB. 
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